Week 8: Machine Learning 2
SECU0057
Bennett Kleinberg
5 Mar 2020
Week 8: Machine Learning 2
Today
- unsupervised learning
- core algorithm in detail
- problems of unsupervised learning
Problem for supervised approaches
- most of the time we don’t have labelled data
- sometimes there are no labels at all
- core idea: finding clusters in the data
Examples
- grouping of online ads
- clusters in crime descriptions
- collections of texts without authors
Practically all interesting problems are unlabelled data problems.
The unsupervised case

Aim
- examining whether there are patterns (e.g. groups in the data)
- possibly: a ‘grouped’ underlying data generation process
- helpful because: reduces dimensions of the data
How to test whether there are patterns?
- separate data into a set number of clusters
- find the best cluster assignment of observations
Common method: k-means algorithm
1. Setting k
Let’s take \(k=4\).
unsup_model_1 = kmeans(data4
, centers = 4
, nstart = 10
, iter.max = 10)
Assigning cluster membership

Iterative algorithm

What happened in the iterations?

The k-means algorithm in detail
- set random centroids in n-dimensional space
- assign each observation to its closest centroid
- find new centroids
- re-assign the observations
- (iterative approach)
Assigning cluster membership

Obtaining distances (errors)

Distance metric
- typically: Euclidean distance
- \(dist(p, c) = \sqrt{(p_1 - c_1)^2 + (p_2 - c_2)^2}\)
\(dist(p[1,1], c[2,3]) = \sqrt{(1 - 2)^2 + (1 - 3)^2} = \sqrt{5} = 2.24\)
Objective: \(\arg \min D(p_i, c_j)\)
After distance-based assignment

New centroids: k-MEANS
| 1 |
1 |
red |
| 1 |
3 |
red |
| 4 |
1 |
blue |
| 5 |
5 |
blue |
\(Mx_{red} = \frac {1+1}{2} = 1\)
\(My_{red} = \frac {1+3}{2} = 2\)
\(M_{red} = [1, 2]\)
New centroids

Iteration after iteration

Cluster membership after iteration 2

Stopping rule
If any of these apply:
- convergence (i.e. no points change cluster membership)
- max. number of iterations (
iter.max = ...)
- distance threshold reached
What’s strange about our approach?
How do we know k?
Possible approach:
- run it for \(n\) combinations: \(k=1, k=2, ... k=n\)
- assess how good k is
What does “good” mean?
Determining k
WSS = within (cluster) sum of squares
- take difference between each point \(x_i\) in cluster \(c_j\)
- remember: \(c_j\) is now the mean of all points \(x_{i,j}\)
- so: we square the difference
\(\arg \min \sum\limits_{x_{i,j}, c_j}(x_{i, j} - c_j)^2\)
Cluster determination
wss = numeric()
for(i in 1:20){
kmeans_model = kmeans(data4, centers = i, iter.max = 20, nstart = 10)
wss[i] = kmeans_model$tot.withinss
}
For \(k=1 .. k=20\)
## [1] 1998.00000 799.23145 529.42464 405.14898 341.16308 293.44305
## [7] 256.25549 226.13568 201.62530 181.03906 163.43303 152.20691
## [13] 143.17168 133.78717 124.50437 117.49929 111.04724 102.77820
## [19] 97.30524 91.73814
Scree plot (= the elbow method)

Other methods to establish k
- Silhoutte method (cluster fit)
- Gap statistic
See also this tutorial.
Silhouette method

Gap statistic

Applying k-means clustering
We settle for \(k = 2\)
unsup_model_final = kmeans(data4
, centers = 2
, nstart = 10
, iter.max = 10)
Plot the cluster assignment

Other unsupervised methods
- k-means (today)
- hierarchical clustering
- density clustering
Issues with unsupervised learning
What’s lacking?
What can you (not) say?
Caveats of unsup. ML
- there is no “ground truth”
- interpretation/subjectivity
- cluster choice
Interpretation of findings

Interpretation of findings
unsup_model_final$centers
## salary height
## 1 -0.7474895 -0.7551138
## 2 0.7937260 0.8018218
- Cluster 1: lower salary, shorter height
- Cluster 2: higher salary, larger height
- People in cluster 1 earn less and are shorter than those in cluster 2
We cannot say more than that!
Interpretation of findings

Interpretation of findings
- subjective
- labelling tricky
- researchers choice!
- be open about this
Cluster choice
What if we chose \(k=3\)?

When k changes, the interpretation changes
## salary height
## 1 -1.1253285 -0.7403048
## 2 0.7959880 1.1611042
## 3 0.4627853 -0.4561074
Interpretation for k=3
- Cluster 1: avg-to-high salary, small
- Cluster 2: very low salary, small
- Cluster 3: high salary, very tall
Cluster choice
- be open about it
- make all choices transparent
- always share code and data (“least vulnerable”" principle)
Important
Note: we cannot say anything about accuracy.
See the k-NN model.
Bigger picture of machine learning
- covered so far: supervised + unsupervised learning
- next week: neural networks
How do supervised and unsupervised learning relate to each other?
Case example
- suppose you want to measure hate speech in the UK
- on Twitter
- and you have 10m Tweets of interest
Possible approach
- you craft rules to determine hate speech vs non-hate speech
- problematic: might not capture all dynamics + costly
Better: supervised machine learning (text classification)
Text classification approach
- you annotate some data (typically crowdsourced)
- you build a supervised learning model
- with proper train-test splitting
- and assess the model with \(Pr_{hatespeech}\)
Suppose you have a good enough model.
Remember
- the aim was to measure hate speech in the UK
- your model should now be good to annotate unlabelled data
- i.e. you can use the model on all Tweets
- and then answer the RQ
What’s next?
- Today’s tutorial + homework: unsupervised learning on in R
Next week: Machine Learning 3